Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 50% (0.50x) speedup for generate_id_within_group in datacompy/core.py

⏱️ Runtime : 63.8 milliseconds 42.6 milliseconds (best of 16 runs)

📝 Explanation and details

The optimization achieves a 49% speedup through three key performance improvements:

1. Reduced DataFrame Subsetting

  • Stores dataframe[join_columns] once as join_df instead of repeatedly accessing it
  • Line profiler shows the original code spent 40.4% of time on the initial null check, reduced to 34.5% in the optimized version

2. Faster Null Detection

  • Replaces .isnull().any().any() with .isnull().to_numpy().any()
  • NumPy's .any() on boolean arrays is significantly faster than pandas' chained .any().any() operations
  • This optimization particularly benefits cases with nulls, showing 56-61% speedups in null-heavy test cases

3. Efficient Value Collision Check

  • Uses NumPy array operations (values_array == default_value).any() instead of DataFrame equality checking
  • Avoids expensive DataFrame-wide equality comparisons by operating directly on the underlying NumPy array
  • The ValueError detection cases show dramatic 137-154% speedups

Impact on Hot Path Usage:
The function is called within _dataframe_merge() when duplicate rows are detected (self._any_dupes is True), creating temporary order columns for unique matching. Since this happens during DataFrame merging operations - a core functionality of the datacompy library - these optimizations will significantly improve performance for datasets with duplicate rows.

Test Case Performance Patterns:

  • Non-null cases: 25-53% faster (simpler code path benefits from reduced subsetting)
  • Null-heavy cases: 46-61% faster (benefits most from NumPy-based null detection)
  • Error cases with value collisions: 137-154% faster (benefits from efficient NumPy equality checking)

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 99 Passed
🌀 Generated Regression Tests 40 Passed
⏪ Replay Tests 194 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_core.py::test_generate_id_within_group 9.12ms 6.05ms 50.7%✅
test_core.py::test_generate_id_within_group_valueerror 948μs 382μs 148%✅
🌀 Generated Regression Tests and Runtime
import pandas as pd

# imports
import pytest
from datacompy.core import generate_id_within_group

# unit tests

# --- Basic Test Cases ---


def test_single_group_no_nulls():
    # All rows have the same group, no nulls
    df = pd.DataFrame({"A": ["x", "x", "x"], "B": [1, 1, 1]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.28ms -> 928μs (37.5% faster)


def test_multiple_groups_no_nulls():
    # Multiple groups, no nulls
    df = pd.DataFrame({"A": ["x", "y", "x", "y"], "B": [1, 2, 1, 2]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.23ms -> 907μs (35.7% faster)


def test_multiple_groups_with_duplicates():
    # Groups with duplicate rows
    df = pd.DataFrame({"A": ["x", "x", "y", "y", "y"], "B": [1, 1, 2, 2, 2]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.22ms -> 886μs (37.5% faster)


def test_single_column_grouping():
    # Grouping on a single column
    df = pd.DataFrame({"A": ["a", "b", "a", "c", "b"]})
    codeflash_output = generate_id_within_group(df, ["A"])
    result = codeflash_output  # 1.02ms -> 679μs (50.6% faster)


def test_empty_dataframe():
    # Empty dataframe should return empty series
    df = pd.DataFrame({"A": [], "B": []})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.02ms -> 705μs (44.6% faster)


# --- Edge Test Cases ---


def test_nulls_in_group_columns():
    # Nulls in join columns
    df = pd.DataFrame({"A": ["x", None, "x", None], "B": [1, 1, None, None]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.75ms -> 1.12ms (56.6% faster)


def test_null_and_duplicate_rows():
    # Nulls and duplicate
    df = pd.DataFrame({"A": [None, None, "a", "a"], "B": [1, 1, 2, 2]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.73ms -> 1.08ms (59.8% faster)


def test_null_and_DATACOMPY_NULL_value():
    # Nulls and DATACOMPY_NULL present
    df = pd.DataFrame({"A": [None, "DATACOMPY_NULL"], "B": [1, 1]})
    # Should raise ValueError
    with pytest.raises(ValueError):
        generate_id_within_group(df, ["A", "B"])  # 899μs -> 354μs (154% faster)


def test_all_nulls_in_group_columns():
    # All group columns are null
    df = pd.DataFrame({"A": [None, None], "B": [None, None]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.62ms -> 1.02ms (58.4% faster)


def test_mixed_types_in_group_columns():
    # Mixed types in group columns
    df = pd.DataFrame({"A": ["x", 1, None, "x"], "B": [None, 2, 2, None]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.74ms -> 1.10ms (58.6% faster)


def test_group_column_with_nan_and_none():
    # np.nan and None both present
    import numpy as np

    df = pd.DataFrame({"A": [np.nan, None, "x"], "B": [1, 1, 1]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.75ms -> 1.11ms (58.1% faster)


def test_group_column_with_empty_string_and_null():
    # Empty string and null are different
    df = pd.DataFrame({"A": ["", None, ""], "B": [1, 1, 1]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.75ms -> 1.10ms (59.6% faster)


def test_group_column_with_all_DATACOMPY_NULL():
    # All group columns are DATACOMPY_NULL but no nulls
    df = pd.DataFrame({"A": ["DATACOMPY_NULL", "DATACOMPY_NULL"], "B": [1, 1]})
    # No nulls, so should work
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.20ms -> 889μs (34.5% faster)


def test_group_column_with_only_one_row():
    # Only one row
    df = pd.DataFrame({"A": ["a"], "B": [1]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.20ms -> 864μs (38.4% faster)


def test_group_column_with_non_string_types():
    # Non-string types in group columns
    df = pd.DataFrame({"A": [1, 2, 1, 2], "B": [3.0, 3.0, 3.0, 3.0]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.19ms -> 861μs (38.5% faster)


def test_group_column_with_boolean_types():
    # Boolean types in group columns
    df = pd.DataFrame({"A": [True, False, True, False], "B": [1, 1, 1, 1]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.20ms -> 880μs (36.7% faster)


# --- Large Scale Test Cases ---


def test_large_dataframe_many_groups():
    # Large dataframe with many groups
    n = 500
    df = pd.DataFrame(
        {
            "A": ["g" + str(i % 10) for i in range(n)],
            "B": [i % 5 for i in range(n)],
        }
    )
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.30ms -> 942μs (37.8% faster)
    # Each group should have n//(10*5)=10 rows, IDs from 0 to 9 for each group
    ids_by_group = {}
    for i, (a, b) in enumerate(zip(df["A"], df["B"], strict=False)):
        key = (a, b)
        ids_by_group.setdefault(key, []).append(result.iloc[i])
    for id_list in ids_by_group.values():
        pass


def test_large_dataframe_with_nulls():
    # Large dataframe with some nulls
    n = 500
    A = ["g" + str(i % 10) if i % 50 != 0 else None for i in range(n)]
    B = [i % 5 if i % 33 != 0 else None for i in range(n)]
    df = pd.DataFrame({"A": A, "B": B})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 2.04ms -> 1.37ms (48.9% faster)
    # Check that all groups have monotonically increasing IDs starting at 0
    groups = df[["A", "B"]].astype(str).fillna("DATACOMPY_NULL").apply(tuple, axis=1)
    id_map = {}
    for idx, grp in enumerate(groups):
        id_map.setdefault(grp, []).append(result.iloc[idx])
    for ids in id_map.values():
        pass


def test_large_single_group():
    # Large dataframe, single group
    n = 1000
    df = pd.DataFrame({"A": ["x"] * n, "B": [1] * n})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.32ms -> 998μs (31.8% faster)


def test_large_dataframe_all_nulls():
    # Large dataframe, all nulls
    n = 1000
    df = pd.DataFrame({"A": [None] * n, "B": [None] * n})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 2.04ms -> 1.39ms (46.5% faster)


def test_large_dataframe_some_DATACOMPY_NULL_and_null():
    # Large dataframe with both DATACOMPY_NULL and null
    n = 500
    A = ["DATACOMPY_NULL" if i == 0 else None for i in range(n)]
    B = [1] * n
    df = pd.DataFrame({"A": A, "B": B})
    # Should raise ValueError
    with pytest.raises(ValueError):
        generate_id_within_group(df, ["A", "B"])  # 928μs -> 377μs (146% faster)


def test_large_dataframe_unique_rows():
    # Large dataframe, all unique groups
    n = 1000
    df = pd.DataFrame({"A": [f"a{i}" for i in range(n)], "B": [i for i in range(n)]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.68ms -> 1.33ms (25.8% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd

# imports
import pytest
from datacompy.core import generate_id_within_group

# unit tests

# ------------------- Basic Test Cases -------------------


def test_basic_single_group_no_duplicates():
    """Test single group, no duplicate rows, should return all zeros."""
    df = pd.DataFrame({"A": [1, 1, 1], "B": [2, 2, 2]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.14ms -> 813μs (40.2% faster)


def test_basic_multiple_groups():
    """Test multiple groups, cumcount resets per group."""
    df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [2, 2, 3, 3]})
    codeflash_output = generate_id_within_group(df, ["A"])
    result = codeflash_output  # 1.05ms -> 686μs (52.7% faster)


def test_basic_multiple_columns_grouping():
    """Test grouping by multiple columns."""
    df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [2, 3, 2, 3]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.14ms -> 777μs (47.0% faster)


def test_basic_with_duplicates_in_groups():
    """Test cumcount increments within each group with duplicates."""
    df = pd.DataFrame({"A": ["x", "x", "y", "y", "x"], "B": [1, 1, 2, 2, 1]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.23ms -> 909μs (35.6% faster)


def test_basic_empty_dataframe():
    """Test empty DataFrame returns empty Series."""
    df = pd.DataFrame({"A": [], "B": []})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.04ms -> 689μs (51.0% faster)


def test_basic_single_row():
    """Test DataFrame with a single row."""
    df = pd.DataFrame({"A": [1], "B": [2]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.12ms -> 781μs (42.8% faster)


# ------------------- Edge Test Cases -------------------


def test_edge_nulls_in_group_columns():
    """Test DataFrame with nulls in join columns."""
    df = pd.DataFrame({"A": [1, None, 1, None], "B": [2, 2, None, None]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.68ms -> 1.04ms (61.0% faster)


def test_edge_nulls_and_duplicates():
    """Test with nulls and duplicate rows."""
    df = pd.DataFrame({"A": [None, None, 1, 1, None], "B": [2, 2, 2, 2, 2]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.75ms -> 1.10ms (59.2% faster)


def test_edge_null_and_default_value_collision():
    """Test that ValueError is raised if 'DATACOMPY_NULL' is present in join columns."""
    df = pd.DataFrame({"A": [None, "DATACOMPY_NULL"], "B": [1, 1]})
    with pytest.raises(ValueError):
        generate_id_within_group(df, ["A", "B"])  # 906μs -> 357μs (154% faster)


def test_edge_non_string_columns():
    """Test with non-string columns (int, float, bool)."""
    df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [True, True, False, False]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.18ms -> 850μs (39.1% faster)


def test_edge_all_nulls():
    """Test when all join columns are null."""
    df = pd.DataFrame({"A": [None, None, None], "B": [None, None, None]})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.63ms -> 1.02ms (58.8% faster)


def test_edge_column_not_in_dataframe():
    """Test when join_columns contains a column not in dataframe."""
    df = pd.DataFrame({"A": [1, 2, 3]})
    with pytest.raises(KeyError):
        generate_id_within_group(df, ["A", "B"])  # 203μs -> 204μs (0.162% slower)


def test_edge_column_with_nan_and_default_value():
    """Test with both NaN and 'DATACOMPY_NULL' in join columns."""
    df = pd.DataFrame({"A": [float("nan"), "DATACOMPY_NULL", 1], "B": [1, 1, 1]})
    with pytest.raises(ValueError):
        generate_id_within_group(df, ["A", "B"])  # 1.02ms -> 432μs (137% faster)


# ------------------- Large Scale Test Cases -------------------


def test_large_scale_many_groups():
    """Test with many groups, each with multiple members."""
    n_groups = 500
    n_per_group = 2
    df = pd.DataFrame(
        {
            "A": [i for i in range(n_groups) for _ in range(n_per_group)],
            "B": [j for j in range(n_groups) for j in [0, 1]],
        }
    )
    codeflash_output = generate_id_within_group(df, ["A"])
    result = codeflash_output  # 1.16ms -> 811μs (43.4% faster)


def test_large_scale_all_unique():
    """Test with all rows unique, should return all zeros."""
    df = pd.DataFrame({"A": list(range(1000)), "B": list(range(1000))})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.26ms -> 943μs (33.9% faster)


def test_large_scale_all_identical():
    """Test with all rows identical, cumcount increments for each row."""
    df = pd.DataFrame({"A": [1] * 1000, "B": [2] * 1000})
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.20ms -> 858μs (39.8% faster)


def test_large_scale_with_nulls():
    """Test large DataFrame with nulls in join columns."""
    df = pd.DataFrame(
        {
            "A": [None if i % 10 == 0 else i for i in range(1000)],
            "B": [i % 5 for i in range(1000)],
        }
    )
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 2.70ms -> 2.10ms (28.7% faster)
    # Check cumcount for first 10 rows with A=None, B=0..4
    null_group = df[(df["A"].isnull()) & (df["B"] == 0)].index


def test_large_scale_performance():
    """Test that function completes in reasonable time for large input."""
    df = pd.DataFrame(
        {"A": [i % 50 for i in range(1000)], "B": [i % 20 for i in range(1000)]}
    )
    codeflash_output = generate_id_within_group(df, ["A", "B"])
    result = codeflash_output  # 1.20ms -> 875μs (36.9% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime

To edit these changes git checkout codeflash/optimize-generate_id_within_group-mi5tvkj8 and push.

Codeflash Static Badge

The optimization achieves a **49% speedup** through three key performance improvements:

**1. Reduced DataFrame Subsetting**
- Stores `dataframe[join_columns]` once as `join_df` instead of repeatedly accessing it
- Line profiler shows the original code spent 40.4% of time on the initial null check, reduced to 34.5% in the optimized version

**2. Faster Null Detection**
- Replaces `.isnull().any().any()` with `.isnull().to_numpy().any()`
- NumPy's `.any()` on boolean arrays is significantly faster than pandas' chained `.any().any()` operations
- This optimization particularly benefits cases with nulls, showing 56-61% speedups in null-heavy test cases

**3. Efficient Value Collision Check**
- Uses NumPy array operations `(values_array == default_value).any()` instead of DataFrame equality checking
- Avoids expensive DataFrame-wide equality comparisons by operating directly on the underlying NumPy array
- The ValueError detection cases show dramatic 137-154% speedups

**Impact on Hot Path Usage:**
The function is called within `_dataframe_merge()` when duplicate rows are detected (`self._any_dupes` is True), creating temporary order columns for unique matching. Since this happens during DataFrame merging operations - a core functionality of the datacompy library - these optimizations will significantly improve performance for datasets with duplicate rows.

**Test Case Performance Patterns:**
- Non-null cases: 25-53% faster (simpler code path benefits from reduced subsetting)
- Null-heavy cases: 46-61% faster (benefits most from NumPy-based null detection)
- Error cases with value collisions: 137-154% faster (benefits from efficient NumPy equality checking)
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 09:56
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant